mask matrix
Appendix Table of Contents
There are several key limitations of the MADE algorithm: 1. As mentioned in Section 3.1, the MADE algorithm can only mask neural networks such that they respect the autoregressive property. The non-deterministic MADE masking algorithm presented in Germain et al. [2015], the resulting Proposition 1 formalizes this point. In Section 3.1, we showed that finding the weight masks for each neural network layer is equivalent Figure 7 provides a visual example of the steps performed by Algorithm 1. 's last row, we need the products of the last row of Randomly generated adjacency structures of 15 dimensions. IP gives better objective values when the adjacency matrix is very sparse.
- North America > United States (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
- North America > United States (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs
Ma, Zuxin, Cui, Yunhe, Qin, Yongbin
Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck -- the time-consuming evaluation of pruning policies -- further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 seconds), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Vancouver (0.04)
- (10 more...)
Pruning for Sparse Diffusion Models based on Gradient Flow
Wan, Ben, Zheng, Tianyi, Chen, Zhaoyu, Wang, Yuxiao, Wang, Jia
Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs. Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality and may result in the removal of crucial weights. Thus we propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion. We employ a progressive soft pruning strategy to maintain the continuity of the mask matrix and guide it along the gradient flow of the energy function based on the pruning criterion in sparse space, thereby avoiding the sudden information loss typically caused by one-shot pruning. Gradient-flow based criterion prune parameters whose removal increases the gradient norm of loss function and can enable fast convergence for a pruned model in iterative pruning stage. Our extensive experiments on widely used datasets demonstrate that our method achieves superior performance in efficiency and consistency with pre-trained models.
- Asia > China > Shanghai > Shanghai (0.06)
- Europe > Switzerland (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
Lu, Songshuo, Wang, Hua, Rong, Yutian, Chen, Zhi, Tang, Yaohua
Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a novel RAG system that redesigns the inference paradigm of the current RAG system by first pre-computing and storing the key-value (KV) caches of documents offline, and then directly retrieving the saved KV cache for prefill. Hence, online computation of KV caches is eliminated during inference. In addition, we provide a number of insights into the mask matrix and positional embedding mechanisms, plus fine-tune a pretrained language model to maintain model accuracy of TurboRAG. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems. Retrieval-augmented generation (RAG) systems have been emerged as a promising direction to alleviate some challenges faced by large models (LMs), e.g., hallucinations (Mallen et al., 2023; Khandelwal et al., 2020; Izacard et al., 2022). As shown in Figure 1a that large-scale documents in these systems are typically segmented into a myriad of short document chunks that can be embedded for retrieval.
- North America > United States > Texas (0.04)
- North America > United States > Oklahoma (0.04)
- North America > United States > New York (0.04)
- (5 more...)
MSTF: Multiscale Transformer for Incomplete Trajectory Prediction
Liu, Zhanwen, Li, Chao, Yang, Nan, Wang, Yang, Ma, Jiaqi, Cheng, Guangliang, Zhao, Xiangmo
Motion forecasting plays a pivotal role in autonomous driving systems, enabling vehicles to execute collision warnings and rational local-path planning based on predictions of the surrounding vehicles. However, prevalent methods often assume complete observed trajectories, neglecting the potential impact of missing values induced by object occlusion, scope limitation, and sensor failures. Such oversights inevitably compromise the accuracy of trajectory predictions. To tackle this challenge, we propose an end-to-end framework, termed Multiscale Transformer (MSTF), meticulously crafted for incomplete trajectory prediction. MSTF integrates a Multiscale Attention Head (MAH) and an Information Increment-based Pattern Adaptive (IIPA) module. Specifically, the MAH component concurrently captures multiscale motion representation of trajectory sequence from various temporal granularities, utilizing a multi-head attention mechanism. This approach facilitates the modeling of global dependencies in motion across different scales, thereby mitigating the adverse effects of missing values. Additionally, the IIPA module adaptively extracts continuity representation of motion across time steps by analyzing missing patterns in the data. The continuity representation delineates motion trend at a higher level, guiding MSTF to generate predictions consistent with motion continuity. We evaluate our proposed MSTF model using two large-scale real-world datasets. Experimental results demonstrate that MSTF surpasses state-of-the-art (SOTA) models in the task of incomplete trajectory prediction, showcasing its efficacy in addressing the challenges posed by missing values in motion forecasting for autonomous driving systems.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Europe > United Kingdom > England > Merseyside > Liverpool (0.04)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Information Technology (1.00)